[SOUND] This lecture is about the,
the basic measures for
evaluation of text original systems.
In this lecture,
we're going to discuss how we design basic
measures [SOUND] to quantitatively,
compare two original [SOUND] systems.
This is a slide that you have
seen earlier in the lecture,
where we talk about the grand
evaluation methodology.
We can have a test collection that
consists of queries, documents and
relevance judgements.
We can then run two systems on these da,
data sets to,
quantitatively evaluate your performance.
And we raised to the question about,
[SOUND] which settles results is better
is System A better or System B better?
[SOUND] So let's now talk about how to
actually quantify their performance.
Suppose we have a total of,
of 10 random documents in
the current folder for this query.
Now, the relevance judgements
shown on the right,
did not include all the ten obviously.
And we have only seen three
rendered documents there but
we can imagine there are other random
documents in judging for this query.
So now, intuitively we thought that
System A is better because
it did not have much noise.
And in particular we have seen,
amount of three results,
two of them are relevant but
in System B we
have five results and
only three of them are relevant.
So intuitively,
it looks like System A is more accurate.
And this can be captured by
a matching order precision.
Where we simply compute to what extent
all the retrieval results are relevant.
If you have 100% precision that would mean
all the retrieval documents are relevant.
So, in this case the system A has
a Precision of two out of three.
System B as three over five.
And this shows that System A is
better by Precision.
But we also talked about
System B might be preferred by
some other users hold like to retrieve
as many relevant documents as possible.
So, in that case we have to compare
the number of relevant
documents that retrieve.
And there is an other
measure called a Recall.
This measures the completeness of
coverage of relevant documents
in your retriever result.
So, we just assume that there are ten
relevant documents in the collection.
And here we've got two of them in
System A, so the recall is two out of ten.
Where as system B has got a three,
so it's a three out of ten.
Now ,we can see by recall
System B is better and these two
measures turned out to be the very basic
measures for evaluating search engine.
And they are very important because
they are also widely used in many other
testing variation problems.
For example, if you look at the
applications of machine learning you tend
to see precision recall numbers being
reported for all kinds of tasks.
Okay, so now, let's define these
two measures more precisely and
these measures are to evaluate
a set of retrieval documents.
So that means we are considering
that approximation
of a set of relevant documents.
We can distinguish it four cases,
depending on the situation of a document.
A document that can be retrieved or
not retrieved, right?
Because we're talking
about the set of result.
The document can be also relevant or
not relevant, depending on whether
the user thinks this is a useful document.
So, we can now have counts of documents
in each of the four categories.
We can have a to represent the number
of documents that are retrieved and
relevant, b for documents that
are not retrieved but relevant, etc.
Now, with this table,
then we have defined precision.
As the, ratio of, the relevant
retriever documents A to the total
number of retriever documents.
So this is just you know,
a divided by the sum of a and c.
The sum of this column.
Signal recall is defined by
dividing a by the sum of a and b.
So that's, again, to divide a by the sum
of the rule, instead of the column.
All right, so we going to see
precision and recall is all focused on
looking at the a, that's the number
of retrieval relevant documents, but
we're going to use different denominators.
Okay, so what would be an ideal result?
Well, you can able to see in ideal
case we have precision and recall, all
to be 1.0 that means we have got 1% of
all the random documents in our results.
And all the results that
we return are relevant.
[INAUDIBLE] There's no single
not relevant document returned.
The reality however, high recall tends
to be associated with low precision And
you can imagine why that is the case.
As you go down the distant to try to get
as many relevant actions as possible.
You tend to in time a lot of non relevant
documents, so the precision goes down.
Look at this set, can also be defined
by a cutoff in a ranked list.
That's why, although these two measures
are defined for a set of retrieved
documents, they are actually very
useful for evaluating a ranked list.
They are the fundamental measures in
tension retrieval and many other tasks.
We often are interested in to
the precision up to ten documents for
web search.
This means we look at the,
how many documents among the top
results are actually relevant.
Now, this is a very meaningful measure,
because it tells us how many relevant
documents a user can expect to see.
On the first page of search results,
where they typically show ten results.
So, precision and recall are,
the basic measures and
we need to use them to further
evaluate a search engine but
they are the building blocks really.
We just to say that there tends to be
a trade off between precision and recall.
So, naturally it would be interesting
to [SOUND] combine them and
here's one measure that's often used,
called f measure.
And it's harmonic mean of precision and
recall, it's defined on this slide.
So you can see it first computed,
inverse of R and P here and
then it would be
interpreted to by using a co,
coefficients.
Depending on the parameter Beta and
after some transformation we can
easily see it would be of this form.
And in many cases it's just
a combination of precision and recall.
And, and Beta is a parameter
that's often set to one.
It can control the emphasis
on precision or recall.
When we set,
beta to one we end up by having a special
case of F measure, often called F1.
This is a popular measure, that is often
used as a combined precision and recall.
And the formula looks very
simple it's just this, here.
Now it's easy to see that if you have,
a larger precision or
larger recall than F
measure would be high.
But what's interesting is that,
the trade off between precision and
recall, is captured in
an interesting way in F1.
So, in order to understand that, we,
can first look at the natural question.
Why not just the,
combining them using a simple
arithmetic mean as a [INAUDIBLE] here.
That would be likely the most
natural way of combining them.
So, what do you think?
If you want to think more,
you can pause the media.
So why is this not as good as F1?
Or what's the problem with this?
Now, if you think about
the arithmetic mean,
you can see that this is the sum of,
of multiple terms.
In this case,
this is the sum of precision and recall.
In the case of the sum, the total value
tends to be dominated by the large values.
That means if you have a very high P or
a very high R,
then you really don't care about the,
whether the other varies is low.
So, the whole sum would be high.
Now, this is not the desirable because
one can easily have a perfect recall.
We can have a perfect recall is it?
Can you imagine how?
It's probably very easy to imagine that
we simply retrieve all
the document in the collection,
then we have a perfect recall and
this will give us 0.5 as the average.
But search results are clearly
not very useful for users,
even though the, the average using
this formula would be relatively high.
Now, in contrast, you can see F1 will
reward a case where precision and
recall are roughly but similar.
So, it would paralyze a case
where you have extremely high
matter for one of them.
So, this means F1 encodes
a different trade off between that.
Now this example shows actually,
a very important methodology here.
When we try to solve a problem,
you might naturally think of one solution.
Let's say, in this case,
it's this arithmetic mean.
But it's important that not
to settle on this solution.
It's important to think whether you
have other ways to combine them.
And once you think about
the multiple variance.
It's important to analyze
their difference and
then think about which
one makes more sense.
In this case,
if you think more carefully you will feel
that if one problem makes more sense.
Then the simple arithmetic mean.
Although in other cases,
there may be, different results.
But in this case, the arithmetic mean,
seems not reasonable.
But if you don't pay attention
to these subtle differences,
you might just, take an easy way to
combine them and then go ahead with it.
And here later you'll find that, hm,
the measure doesn't seem to work well.
Right so, at this methodology
is actually very important in
general in solving problem and
try to think about the best solution.
Try to understand that the problem,
very well and then know why
you needed this measure, and why you
need to combine precision and recall.
And then use that to guide you in
finding a good way to solve the problem.
To summarize, we talk about precision,
which addresses the question,
are the retrieval results all relevant?
We'll also talk about the recall,
which addresses the question,
have all the relevant
documents been retrieved?
These two are the two basic measures
in testing retrieval in variation.
They are are used for, for
many other tasks as well.
We'll talk about F measure as a way
to combine precision and recall.
We also talked about the trade
off between precision and recall.
And this turns out to depend
on the users search tasks and
we'll discuss this point
more in the later lecture.
[MUSIC]

